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1.  Summary 


Synthetic  DNA  is  proposed  for  use  as  an  infonnation  storage  media  and  three- 
dimensional  structural  material  in  nanotechnology.  Such  applications  of  DNA  require 
collections  of  oligonucleotides  that  do  not  crosshybridize.  Thus,  there  is  a  need  to 
efficiently  create  large  collections  of  non-crosshybridizing  oligonucleotides.  The  process 
of  designing  them  has  come  to  be  known  as  DNA  word,  or  DNA  code,  design.  In  this 
research,  the  DNA  code  generating  software,  SynDCode,  was  developed  and  refined  to 
include  pseudoknot  secondary  structure,  simulate  hybridization  assays,  and  applied  to 
DNA  hybridization  algorithms. 

2.  Introduction 

SynDCode  provides  the  means  to  create  collections  of  synthetic  DNA  strands  (i.e.,  a 
DNA  code)  with  controlled  properties  such  as  resistance  to  crosshybridization.  The  user 
has  the  ability  to  verify  the  properties  of  an  existing  DNA  code,  expand  a  given  DNA 
code,  or  create  an  entirely  new  DNA  code.  The  models  built  into  SynDCode  allow  for  the 
specification  of  thermodynamic  properties  of  the  generated  DNA  code  and  for  collections 
of  concatenated  combinations  of  strands  taken  from  the  generated  code.  SynDCode  also 
can  be  used  to  construct  DNA  codes  that  do  not  disrupt  external  and  functional 
oligonucleotides,  e.g.,  priming  sites  and  it  can  use  to  construct  codes  that  contain 
important  motifs,  e.g.,  restriction  sites. 

SynDCode  captures  the  key  aspect  of  the  nearest  neighbor  thermodynamic  model  for 
hybridized  DNA  duplexes  in  a  computationally  efficient  way.  All  pairwise  strand 
computations  have  complexity  0(n2)  where  n  is  the  length  of  the  strand.  This  is 
significant  improvement,  as  other  DNA  code  software  tools  based  in  nearest  neighbor 
thennodynamics  have  complexity  0(n3). 

3.  Methods,  Assumptions,  Procedures 

3,1  Complemented  DNA  Codes 

Single  strands  of  DNA  are,  abstractly,  (A,  C,  G,  T)  -quaternary  sequences,  with  the  four 
letters  denoting  the  respective  nucleic  acids.  In  this  paper,  when  we  write  DNA 
molecules  without  indicating  the  direction,  it  is  assumed  that  the  direction  is  5'  -» 3'  .To 
obtain  the  reverse  complement  of  a  strand  of  DNA,  first  reverse  the  order  of  the  letters 
and  then  substitute  each  letter  with  its  canonical  Watson-Crick  complement.  For 
example,  the  reverse  complement  of  AACGTG  is  CACGTT  .If  x  =  AACGTG ,  then 

we  let  x  denote  its  reverse  complement  CACGTT  .  A  complemented  DNA  code  C  is 
a  collection  of  Watson-Crick  pairs  of  complementary  DNA  sequences,  i.e., 
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C  =  {{xi,xi}}  .  The  greatest  energy  of  duplex  formation  is  obtained  when  two  sequences 

are  reverse  complements  of  one  another  and  the  DNA  duplex  formed  is  a  Watson-Crick 
(WC)  duplex.  However,  there  are  many  instances  when  the  formation  of  non-WC 
duplexes  is  energetically  favorable.  In  this  paper,  a  non-WC  duplex  is  referred  to  as  a 
crosshybridized  (CH)  duplex.  A  basic  goal  of  DNA  code  design  is  to  be  assured  that  a 
fixed  temperature  can  be  found  that  is  well  above  the  melting  point  of  all  CH  and  well 
below  the  melting  point  of  all  WC  duplexes  that  can  form  from  strands  in  the  code.  It  is 
also  desirable  for  all  WC  duplexes  to  have  melting  points  in  a  narrow  range.  A 
complemented  DNA  code  with  this  property  is  said  to  have  high  binding  specificity.  High 
binding  specificity  is  akin  to  a  high  signal-to-noise  ratio. 

3.2  SynDCode 

SynDCode  provides  the  means  to  create  collections  of  synthetic  DNA  strands  with 
controlled  properties  such  as  resistance  to  crosshybridization.  The  user  has  the  ability  to 
verify  the  properties  of  an  existing  DNA  code,  expand  a  given  DNA  code  or  create  an 
entirely  new  DNA  code.  The  models  built  into  SynDCode  allow  for  the  specification  of 
thennodynamic  properties  of  the  generated  DNA  code  and  for  collections  of 
concatenated  combinations  of  strands  taken  from  the  generated  code.  SynDCode  can  be 
used  to  construct  DNA  codes  that  do  not  adversely  interact  with  functional 
oligonucleotides  external  to  the  code,  e.g.,  priming  sites,  and  it  can  construct  codes  that 
contain  important  motifs,  e.g.,  restriction  sites. 

The  mathematical  models  built  into  SynDCode  come  from  [1]  and  allow  for  the 
specification  of  thermodynamic  properties  of  the  generated  DNA  code  and  for  collections 
of  concatenated  combinations  of  strands  taken  from  the  generated  code.  All  pairwise 
strand  computations,  including  the  computation  of  the  bound  on  free  energy  of  formation, 
G(x  :  y )  ,  of  the  x:y  duplex  have  complexity  0(n2)  .  This  is  significant,  as  other  DNA 
code  software  tools,  including  SLSDesigner  [2]  use  G(x  :  y)  approximation 
algorithms  with  complexity  Of/?3)  .  These  more  computationally  complex  algorithms 
give  more  accurate  pairwise  G(x  :  y)  computations,  but  pairwise  accuracy  is  not 
necessarily  the  most  important  consideration  when  the  primary  objective  is  maximizing 
the  size  of  a  code  for  global  thresholds. 

It  should  be  noted  that  distance  functions  (like  ours  in  SynDCode)  which  are  based  on  the 
theory  of  t-gap  block  isomorphic  subsequences  are  generalizations  of  the  classical 
Levenshtein  insertion-deletion  distance  and  are  very  different  from  the  h-distance 
frequently  discussed  in  the  DNA  computing  literature.  The  h-distance  is  a  simple 
translation  of  the  Hamming  distance  which,  when  applied  to  modeling  potential 
secondary  structures  between  two  DNA  strands  of  length  n,  takes  into  account  at  most 
2 n  possible  secondary  structures  between  them.  In  stark  contrast,  the  mathematical 
methods  embedded  in  SynDCode  give  information  about  all  potential  secondary 
structures  between  two  DNA  strands  and  provides  this  infonnation  in  a  computational 
efficient  way. 
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3.2.1  Weighted  Stem  Measure  of  a  DNA  Duplex 

The  notation  from  previous  reports  [1]  and  [8]  are  used  throughout.  A  natural 
simplification  for  formulating  binding  specificity  is  to  base  it  upon  the  maximum  number 
of  WC  (inter-strand,  non-covalent  hydrogen)  base  pair  bonds  between  complementary 

letter  pairs  which  may  be  formed  between  two  oppositely  directed  strands.  Let  x  :  y 

denote  the  CH  formed  between  x  and  y  when  y  is  the  WC  complement  of  y  . 
Then  an  upper  bound  on  this  maximum  number  of  base  pair  bonds  that  can  form  in  the 

x  :  y  duplex  is,  i//q  (x,  y)  ,  the  maximum  length  of  a  common  subsequence  to  x  and 

v  .  This  doesn’t  mean  that  x  and  y  will  form  d  base  pair  bonds  in  a  hybridization 
assay;  it  just  says  they  could  never  form  more  than  d  base  pair  bonds.  In  [1],  this 
measure  was  denoted  by  y/\i  Ofi  >’)  where  Q  is  the  constant  function  1  . 

If  the  binding  specificity  were  solely  dependent  on  the  number  of  base  pair  bonds,  then 
DNA  codes  constructed  by  using  y/(l  (x,  y)  as  the  constraint  could  be  used  in 
hybridization  assays  with  assured  high  binding  specificity.  However,  the  state  of  the  art 
model  of  DNA  duplex  thermodynamics  is  the  Nearest  Neighbor  Model  (NN).  In  the  NN 
model,  thermodynamic  (e.g.,  free  energy)  values  are  assigned  to  loops  rather  than  base 
pairs.  Consider  two  oppositely  directed  DNA  strands 

x  =  5  x1,x2,...,x,.,...,x„3' 
y  =  3  yx,y2,...,y j,...,yn5 

where  y  j  denotes  the  complement  to  base  Vy  .  A  secondary  structure  of  the  DNA 

duplex  x  :  y  is  a  sequence  of  pairs  of  complementary  bases  ((xy  ,  yj  ))  where  (x^  ) 

and  {y jr )  are  subsequences  of  x  and  y  respectively.  Clearly  the  duplex  x  :  y  can 
have  many  secondary  structures.  An  important  issue  is  to  understand  which  secondary 

structure  is  the  most  energetically  favorable.  The  duplex  x  :  y  can  have  a  t-stem  if  and 
only  if  there  are  strings  a,(3<(n)  with  a  =  [i,i  +  t =  [j,j  +  t -i]  with 
xa  =  y jj  where  y  =  5'y1,y2,...,yn3'.  A  maximal  t-stem  is  one  that  is  not  properly 
contained  in  another  larger  t’-stem.  Every  maximal  t-stem  contains  maxfi  -  j  +  1,0)  j- 

stems.  In  [1]  an  efficient  means  of  computing,  i//q(x,  v),  the  maximum  weighted  sum 
of  the  common  t-stems  that  can  occur  taken  over  all  possible  secondary  structures  for  the 

x  :  y  duplex  is  given.  Its  method  of  computation  is  the  basis  of  SynDCode. 

Suppose  Q  =  F  assigns  each  2-stem  its  Nearest  Neighbor  thermodynamic  free  energy 
value  as  given  in  Figure  1.  The  ( i,j)th  entry  of  Table  1  is  the  value  of  F(i,j )  .  For 
example,  F(C,T)  =  1.28  .  This  function  is  defined  by  setting  F{i,  j)  to  free  energies  for 
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2-stems  given  in  [10].  So  -  F(C,T )  denotes  the  free  energy  associated  with  the 
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Figure  1:  Thermodynamic  weight  of  2-stems  and  mismatched 
stacked  pairs.  Mismatched,  but  stabilizing  stacked  pairs  are  in 
green,  perfect  2-stems  are  in  orange 


Let  x  :  y  be  a  CH  duplex  and  let  A G(x  :  y )  and  A G(x  :  x)  be  the  NN 

computation  of  the  free  energy  of  the  x  :  y  CH  and  x  :  x  WC  duplexes  respectively. 

In  [1],  it  is  shown  that  ~^(x,y)<AG(x  :  y )  and  -^(x,x)  =  AG(x  :  x). 
SynDCode  generates  blueprints  for  WC  pairs  in  complemented  DNA  codes  for  which 
y/ q  (x,  y )  is  the  basic  measure  of  compliance. 


4.  Results,  Discussion 

4.1  SynDCode  Inputs 

The  inputs  to  SynDCode  can  be  roughly  divided  into  four  categories:  codeword 
generation  parameters,  code  constraint  parameters,  code  catenation  option(s)  and  code 
extension.  The  code  generation  inputs  are:  "Length  of  DNA  Codewords",  "Initial  Markov 
Probability  Parameter",  "Size  of  Additional  Interval",  "Size  of  Probe  Interval", 
"Undesired  Substrings"  and  "Guanine  Only  in  Complement  Strands".  The  code 
constraint  inputs  are  "Stem  Sizes  to  Check",  "Corresponding  Stem  Thresholds",  "Include 
Stabilizing  Mismatch  Stacked  Pairs",  "Maximum  CH  Duplex  Free  Energy  Upper  Bound" 
and  "WC  Duplex  Free  Energy  Bounds."  The  code  concatenation  option  is  either  turned 
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on  or  off.  The  code  extension  options  allow  the  user  to  verify,  extend  or  verify  and 
extend  existing  code.  See  Figure  2. 


Figure  2:  SynDCodeGUI.  Code  Generation  Windows 

4.1.1  Codeword  Generation  Parameters 

The  input  "Length  of  DNA  Codewords"  allows  the  user  to  select  the  desired  length  (in 
nucleotides)  of  each  codeword  and  its  complement.  SynDCode  generates  a  random  code, 
but  it  doesn't  (necessarily)  do  so  by  selecting  from  the  uniform  distribution.  The  input 
"Initial  Markov  Probability  Parameter"  allows  the  user  to  set  the  initial  Markov 
probability  parameter  k  to  be  used  during  candidate  codeword  generation.  SynDCode 
generates  candidate  sequences  xl,x2,x3  ...,xn  as  reported  earlier  [8],  The  parameters 

"Undesired  Substrings",  "Guanine  Only  in  Complement  Strands"  are  obvious  and  were 
also  previously  described  in  [8].  If  a  user  selects  the  option  "Guanine  Only  in 
Complement  Strands"  then  x  is  generated  in  a  manner  similar  to  the  above.  That  is,  the 
value  of  xl  is  selected  from  {A,C,Tj  with  uniform  probability.  Then,  the  remaining 
entries  are  generated  by  a  Markov  chain  given  by  a  similar  transition  matrix,  as  in  [8], 
where  only  {A,  C,  Tj  are  considered  with  parameter  k  >  3  . 

4.1.2  Code  Constraint  Parameters 

The  code  constraint  parameters  are  described  in  terms  of  y/ q(x,v).  Let  y/ '  (x,  y) 
denote  y/h (x,y)  where  Q  =  1  and  let  y/'n(x,y )  be  as  described  above.  The  input 
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"Stem  Sizes  to  Check"  allows  the  user  to  set  the  values  of  t  for  which  y/' (x,y)  will  be 
considered.  The  input  "Corresponding  Stem  Thresholds "  allow  the  user  to  set  an  upper 
bound  Sj  on  the  number  of  tt  -stems  in  a  CH  duplex.  For  example,  if  the  user  enters  2, 
3,  4  (not  required  to  be  consecutive  or  in  increasing  order)  in  "Stem  Sizes"  and  8,  4,  2  in 

"Corresponding  Stem  Threshold"  then  every  CFI  duplex  x  :  y  taken  from  the  generated 
code  will  have  y/  (x,  y)  <  8,  i//  (x,  y)  <  4  and  y/  (x,  y)<  2  .  The  input  "Maximum 

'j 

CH  Duplex  Free  Energy  Parameter"  allows  the  user  to  set  an  upper  bound  for  y/F  (x,y). 

•y  — 

Since  -  y/F  (x,  y)  <  AG(x  :  y)  this  allows  the  user  to  bound  the  free  energy  of  formation 
of  all  CH  duplexes  in  the  generated  code.  The  "Maximum  CH  Midpoint  Duplex  Free 
Energy"  parameter  allows  the  user  to  bound  the  free  energy  of  all  CH  duplexes  between 
any  strand  and  any  half  strand.  In  addition,  all  possible  junctions  x2yi  ,  where  x2  is 
the  second  half  of  strand  x  and  y1  is  the  first  half  of  strand  y  ,  and  any  other  strand 

■y  — 

z  will  have  a  free  energy  of  formation  -  y/ F  (x2yj ,  z)  <  2  *  A Gmid  (x  :  y)  .  The  inputs 

"WC  Duplex  Free  Energy  Bounds  (Lower,  Upper)"  allows  the  user  to  ensure  that  all 
desired  hybridized  WC  duplex  have  a  free  energy  within  a  desired  range.  All  WC  pairs 

x  :  x  in  the  generated  code  will  have  y/j?{x,x)  =  -A G(x,x)  between  the  selected 
lower  and  upper  values.  For  example,  if  17.00  and  22.00  are  selected  as  the  lower  and 

upper  values  then  each  WC  pair  x  :  x  in  the  generated  code  with  have 

-  22.00  <AG(x  :  x)< -17.00.  It  is  also  known  that  single  internal  mismatches  often 
times  provide  structural  stability.  See  Figure  1.  When  the  "Include  Stabilizing  Mismatch 
Stacked  Pairs"  option  is  selected,  SynDCode  includes  these  energetically  favorable 

structures  during  computation  of  y/ F  (x,  y)  and  yrF  (x,  y)  . 

4.1.3  Pseudoknot  Elimination 

SynDCode  restricts  secondary  structure  by  considering  y/F(x,x),  and  also  predicts  an 
upper  bound  on  the  optimal  minimum  free  energy  for  an  individual  DNA  sequence  with 
any  potential  pseudoknot  secondary  structure  xoaoh  :  xa-  a-  ■  *n  other  words,  every 

a!  b' 

pseudoknot  contains  four  distinct  sections  generally  denoted  as  a,b,a’,b’.  See  Figure  3. 


rrTCC<:^”^GGAA^GGGAAA^"i|r"MTTCCTO 


Figure  3:  Sequence  Partition  for  Pseudoknot.  Blue  sections  are  a  and  a',  red 
sections  are  b  and  b'.  The  sequence  midpoint  is  marked  by  the  yellow  dot  and 
the  cut  point  marked  by  the  green  dot. 
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The  algorithm  operates  by  splitting  the  second  half  of  the  strand  into  two  distinct 
partitions  a’  and  b’.  There  are  n/2  -  1  distinct  partitions  of  the  second  half  of  a  strand. 
Every  partition  is  considered  and  each  is  referenced  by  the  length  of  a’  denoted  with  a 
outpoint  value.  This  pseudoknot  free  energy  upper  bound  algorithm  also  has 

computational  complexity  0(n  )  .  This  model  is  sufficient  for  the  purpose  of  DNA 
code  design  where  exact  structure  prediction  is  not  important.  The  main  goal  is  to  ensure 
that  any  pseudoknot  formation  is  so  unstable  that  it  will  not  form.  So,  this  function 
predicts  the  absence  of  pseudoknot  secondary  structure.  There  are  n/2  -  1  distinct 
partitions  of  the  second  half  of  a  strand.  Every  partition  is  considered  and  each  is 
referenced  by  the  length  of  a’  denoted  with  a  outpoint  value.  The  sections  a  and  b  are 
dynamically  derived  in  order  to  produce  an  upper  bounded  free  energy  on  the  structure. 
In  Figure  4,  where  the  outpoint  is  3,  a’  is  assigned  to  “CAC”  and  b’  to  “CTAAGTCGG." 
Then,  the  Watson-Crick  complements  of  a’  and  b’  are  generated,  concatenated  together, 
and  checked  versus  the  unaltered  first  half  of  the  strand  producing  an  upper  bound  on  the 
free  energy  of  pseudoknot  formation.  The  dot-parenthesis-bracket  notation  is  generated 
above  by  applying  a  backtracking  algorithm  after  computing  the  free  energy.  Any  red 
open  parenthesis  represents  the  structural  a  section,  any  red  closed  parenthesis  represents 
the  a’  section,  any  green  closed  bracket  is  part  of  the  b  section,  and  any  open  bracket  is 
considered  part  of  the  b’  section.  Any  dot  represents  an  unbonded  base.  The  matrix  for 
the  comparison  where  the  outpoint  is  3  is  in  Figure  5. 


Cutpoint  1 :  Thermodynamic  free  energy  =  -8.?? 

AGGTGCCGACAACACCTAAGTCGG 

■  •  ■  ■  (  •  •  ) . 

Cutpoint  2:  Thermodynamic  free  energy  =  -10.44 

AGGTGCCGACAACACCTAAGTCGG 

■  •  •  (  (  •  -  )  ) . 

Cutpoint  3:  Thermodynamic  free  energy  =  -1 1 .88 

AGGTGCCGACAACACCTAAGTCGG 
.  ■  (  (  ( 

Cutpoint  4:  Thermodynamic  free  energy  =  -13.72 

AGGTGCCGACAACACCTAAGTCGG 

.  (  (  (  (  ..))))... 

Cutpoint  5:  Thermodynamic  free  energy  =  -15.00 

A  G  GTGCCGACAACA  C  C  .  A  A  G  T  C  G  G 

(  (  (  (  (  ..)))))■- 

Cutpoint  5:  Thermodynamic  free  energy  =  -15.00 

AGGTGCCGACAACACCTAAGTCGG 

(  {  (  (  (  ..))))).. 


Cutpoint  7:  Thermodynamic  free  energy  =  -15.00 

AGGTGCCGACAACACCTAAGTCGG 

(  (  (  (  (  ..))))).. 

Cutpoint  8:  Thermodynamic  free  energy  =  -13.55 

AGGTGCCGACAACACCTAAGTCGG 
(  (  (  (  ( 

Cutpoint  ?:  Thermodynamic  free  energy  =  -12.25 

AGGTGCCGACAACACCTAAGTCGG 
(  (  (  (  ( 

Cutpoint  10:  Thermodynamic  free  energy  =  -10.0? 

AGGTGCCGACAACACCTAAGTCGG 

(  (  (  (  (  . )  )  )  )  ) . 

Cutpoint  1 1 :  Thermodynamic  free  energy  =  -8.25 

A  G  G  TGCCGACAACACCTAAGTCGG 
(  (  (  (  (  . )  )  )  )  ) . 


Figure  4:  SynDCode  Pseudoknot  Alignment  Output 
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An  occurrence  of  most  stable  structure  appears  at  cutpoint  six  where  the  free  energy  is  - 
15  Kcal/Mol.  The  structure  is  given  in  Figure  6a. 


Figure  6a:  Pseudoknot  2-d  structure  of  that 
which  was  linearly  represented  as  cutpoint  6  in 
Figure  5. 
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File  View 


Figure  6b:  SynDCode  Aligner  GUI 


4.1.4  Example  of  Output  and  Junction  Option 

The  strands  in  Figure  7  are  part  of  code  generated  using  the  above  parameters  from 
Section  4.1.2  when  the  "Apply  Junction  Constraint"  is  selected.  The  WC  pairs  are  labeled 
S;  and  C; . 


Encoding 

Reading/Probe 

Sx=  CCTTTTTTTTTTTTCC 
S2=  CCAATTTTCAAAACTT 
Sa=  TTAACCTTCCTTAAAA 
S4=AATCCATTTTTACCAA 
Ss=CAACCAAAAACAAATT 
S6=TCCTCTTCCATCCTCC 
S7=  ACCTCCACTTATTTAA 

s8=tactcacactttacaa 
s9=  ctcaattctttctttt 

Sio=TAAACACCTATTCATT 

Su=CTCTAAATCTAAACCT 

Si2=ATCACAACATATTTCC 

C'2=  AAGTTTTGAAAATTGG 
Cs=  TTTTAAGGAAGG1 TAA 
C4=TTGG TAAAAATG G ATT 

c8=aatttgtttttggttg 

C'6=  G  G  AG  G  ATGG  A  AG  AG  G  A 
C7=  TTAAATAAGTGGAGt  1 
c8=ttgtaaagtgtga<  ;  i  a 

Cg  =  A  A  A  AC  AAAGAATTG  AG 
C10=  AATGAATAGGTGTTTA 
Cii=AGGTTTAGATTTAGAG 
Ci2=GGAAATATGTTGTGAT 

Figure  7 :  A  Coupled  DNA  Code 
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Here  are  the  facts  about  this  code: 


1.  Each  of  the  12  WC  pairs  have  a  free  energy  of  formation  parameter  of  between  -22 


and  -17  KCAL  per  mole  (i.e.,  -22  <  -y/p{x,x)  <  -17.) 

2.  No  strand  in  the  code  contains  the  substring  GGG  or  its  complement  CCC. 

3.  In  each  WC  pair  only  one  strand  contains  a  G. 

(24\ 

-12  potential  CH  pairs  x  :  y  with  x^y  have  a  free  energy  of 

V  2  ) 


4.  Each  of  the 


formation  parameter  that  is  not  below  -9  KCAL  per  mole  (i.e.,.  -  9  <  -y/2  (x,  y)  ).  Also 

each  of  the  24  CH  pairs  x  :  x  have  a  free  energy  of  formation  parameter  that  is  not 
below  -9  KCAL  per  mole. 

r24\ 

- 12  potential  CH  pairs  have  more  than  eight  2  -stems  in  any 


5.  None  of  the 


V  2  J 


secondary  structure  (i.e.,.  y/2(x,y)<  8.  ) 
f  24^) 

6.  None  of  the  -12  potential  CH  pairs  have  more  than  four  3  -stems  in  any 

V  2  J 

secondary  structure  (i.e.,.  y/3(x,y)  <4.  ) 
f24^ 


7.  None  of  the 


V  2  J 


-12  potential  CH  pairs  have  more  than  two  4  -stems  in  any 


possible  secondary  structure  (i.e.,.  ^4(x,y)<2.  ) 


When  the  "Apply  Junction  Constraint"  option  is  selected  the  above  complemented  code 
satisfies  additional  constraints  which  we  now  describe.  Using  the  encoding  strands  S  ,■ 
with  1  <  z  <10  the  following  "combinatorial  set"  of  64  concatenated  strands 
X1X1X3X4X5  where  X  i  =S2,-  or  S  2(_i  can  be  constructed.  The  combinatorial  set 

denotes  all  bit  strings  of  length  6  if  we  let  each  S  2*— 1  denote  "false"  and  each  S  2(- 
denote  "true."  To  reduce  crosshybridization  potential  between  these  longer  concatenated 
strands  and  the  probe  strands  C,  ,  our  code  satisfies  further  "catenation  constraints." 

These  catenation  constraints  largely  follow  from  the  "Maximum  CH  Midpoint  Duplex 
Free  Energy"  function.  To  aid  in  the  discussion  of  this  constraint,  consider  the  following 
set  of  junction  sequences  depicted  in  Figure  8. 
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Encoding 

Junction 

Encoding 

Junction 

s  !  -  CCTTT T I  T  - T T T T T  T C C 

S1S4 

B t;-  - '  1  .1  .  .  1  ITCC  64 

SiiSr 

SeSs 

S2-  CCAATTTT-CAAA  ACTT 

S2S3 

list 

6  7-  ACCTCC A-CTTATTTA A 

S7S1J 

TTA  ACCTT-CCTTA  AAA 

6c- TACTC  AC  A-CTTTACA  A 

S7S10 

S;iSf; 

S-S10 

84- A  ATCC  ATT-TTTACC  A  A 

S4SB 

§*•§« 

69-  CTO  A  ATT «  1  1  PCTTTT 

S-jSu 

Mu 

65-  C  A  ACC  A  A  A  -  A  AC  A  A  ATT 

s5sT 

_ SsSa _ 

6  10-TA  A  A<  A<  1  -  1  A  .  1  CA1  1 

Siu^n 

_ Sn.S.a _ 

Figure  8:  Junction  Sequence  of  a  Coupled  DNA  Code 


The  junction  sequence  S;S  ;  where  j  =  i  +  \  ,  /  +  2  if  i  is  even  and  j  =  i  +  2,i  +  3  if 
i  is  odd  is  the  second  half  of  sequence  S  and  the  first  half  of  sequence  S  ;  .  For 
example,  8^4  =TTTTTTCC  AATCCATT.  For  the  above  code  of  12  pairs  of  strands, 
we  have  a  total  of  20  such  junction  strands.  If  we  add  these  junctions  strands,  we  have  a 
total  of  44  strands  under  consideration.  With  the  catenation  constraint  selected, 
SynDCode  generates  the  12  pairs  of  strands  in  Figure  7,  so  that  not  only  are  the 
conditions  1  to  7  above  satisfied  for  the  strands  in  Figure  7,  but  each  of  the  possible 
^44^| 

- 12  CH  duplexes  that  can  be  formed  from  the  strands  in  Table  3  also  satisfy  the 

v  2  J 

same  CH  constraints.  Thus  the  strands  in  Figure  7  were  generated  by  SynDCode  in  such 
a  way  so  that  the  strands  in  Figure  7  satisfy  conditions  1  to  7  and  the  strands  in  Figure  8 
satisfy: 


8.  Each  of  the 


f443 

V  2  J 


-12  potential  CH  pairs  have  a  free  energy  of  formation  that  is  not 


below  -9  KCAL  per  mole  (i.e.,.  -9<-y/lF(x,y)  ) 

(44^ 

- 12  potential  CH  pairs  have  more  than  eight  2  -stems  in  any 
V  2 


9.  None  of  the 


secondary  structure  (i.e.,.  i//z(x,y)<  8.  ) 
744A 


10.  None  of  the 


V  2  J 


-12  potential  CH  pairs  have  more  than  four  3  -stems  in  any 


secondary  structure  (i.e.,.  y/i(x,y)<  4.  ) 

[44] 

-12  potential  CH  pairs  have  more  than  two  4  -stems  in  any 

V  2  J 


1 1 .  None  of  the 


possible  secondary  structure  (i.e.,.  ^4(x,y)<2.  ) 
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5.  Conclusions 


5.1  Self-Assembly 

The  basic  goal  of  DNA  self-assembly  is  the  autonomous  formation  of  WC  duplexes.  The 
main  issue  is  that  when  many  strands  are  present  in  a  solution  crosshybridization  can 
occur;  leading  to  errors  during  the  assembly  process.  SynDCode  can  be  used  to  mitigate 
the  CH  problem  and  also  to  address  the  probabilistic  and  dynamic  aspects  of  hundreds  of 
strands  in  solution. 

5.1.1  The  Partition  Function 

Suppose  we  have  a  collection  of  strands  {X1  X2,...,X2nj  where  X  t  X M  are  WC 
complements.  For  a  given  Xt  we  consider  2n-  \  possible  duplexes  that  contain  Xt 
as  the  possible  states  that  Xt  can  be  in.  Let  A  Gy  be  the  free  energy  of  fonnation  of 
the  duplex  Xt  Xj .  In  this  application,  we  think  of  AGy  only  as  a  function  of 
temperature.  The  A  Gy  of  XtX j  is  computed  by  the  latest  version  of  SynDCode 

which  allows  the  user  to  select  the  temperature  of  the  assay.  The  thermodynamic  values 
of  stacked  pairs  at  a  given  temperature  T  are  computed  by  using  the  Gibbs  formula 

AG  =  AH-TAS 

where  the  values  of  AH  and  AS  in  Figure  9  come  from  [10]. 

Propagation  All  AS  AGjT 

itqnencr  ikuilni.il  ')  (e.u.)  (kcalmal  ') 

AA'TT  -7.6  -21.3  -1.00 

AT7TA  -7.2  -20.4  -0.88 

TA/AT  -7.2  -21.3  -0.58 

CA/GT  -8.5  -22.7  -1.45 

GT/CA  -8.4  -22.4  -1.44 

CT/GA  -7.8  -21.0  -1-28 

GA/CT  -8.2  -22.2  -1J0 

CG'GC  -10.6  -27.2  -2.17 

CiC.'CG  -9.8  -24.4  -224 

GCVCC  -8.0  -19.9  -1.84 

Figure  9:  Enthalpy  and  Entropy 
of  Stacked  Pairs  at  37C 

The  probability  that  Xt  is  paired  with  Xjq  is  given  by  the  partition  function 

exp(AGi/0  /RT) 

exp(AGy0  /  RT)+  Z  exp(A Gy/RT)' 
j*J0 

To  model  self-assembly,  the  probability  that  in  an  ensemble  of  strands,  each  strand  is 
paired  with  its  complement  and  how  long  it  takes  to  get  to  that  state  is  important.  This 
computational  and  simulation  software  is  now  part  of  SynDCode.  SynDCode  can  be 
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used  to  select  the  optimal  temperature  at  which  the  self  assembly  assay  should  occur. 

For  example,  consider  a  DNA  Code  with  2n  strands.  If  the  WC  free  energy 
-A  Gwc  >  w  and  the  CH  free  energy  -A  GCH  <  c  ,  then  given  that  each  object  has 
access  to  each  state,  the  probability  that  each  strand  will  find  its  complement  in  an 


ensemble  is  at  least 


exp  (w/ RT) 


exp(w/ i?r)+(«-l)exp(c/ RT) 


Applying  this  analysis  to  the  code  in 


Figure  7  (first  10  pairs)  which  has  -AGWC>\7  and  -  AGa/  <9  (at  3 1 0  ),  then 

the  probability  that  each  strand  pairs  with  its  complement  in  the  total  ensemble  is  over 
99.99%.  The  self-assembly  hybridization  of  this  code  was  simulated  using  SynDCode. 
See  Figure  10. 


Figure  10:  Epochal  Self-Assembly  Simulation.  Each 
encoding  S  strand  is  accompanied  with  its  complement 
probe  strand  C  in  its  right-hand-side  adjacent  column.  At  the 
end  of  each  epoch  (row),  every  strand  in  the  pool  is  either 
free  (colored  green),  cross-hybridized  bonded  (red),  or 
Watson-Crick  bonded  (blue). 


5.1.2  Self-Assembly  Simulation 

Figure  10  illustrates  the  hybridization  of  10  distinct  DNA  Watson-Crick  pairs  in  silico 
with  each  column  representing  1  strand  and  each  row  representing  1  epoch.  Each 
encoding  S  strand  is  accompanied  with  its  complement  probe  strand  C  in  its  right-hand- 
side  adjacent  column.  For  example,  S4  is  addressed  to  column  7  with  C4  addressed  to 
column  8.  At  the  end  of  each  epoch,  every  strand  in  the  pool  is  either  free  (colored 
green),  cross-hybridized  bonded  (red),  or  Watson-Crick  bonded  (blue).  During  each 
epoch  two  strands  are  selected  for  potential  hybridization.  Here,  there  are  three  distinct 
interactions  that  may  occur.  The  first  interaction  involves  selecting  two  free  strands,  the 
second  involves  selecting  one  free  and  one  bonded  strand,  and  the  last  involves  selecting 
two  bonded  strands.  SynDCode  decides  which  is  most  stable.  If  both  strands  are  free, 
they  will  become  bonded  and  form  a  duplex.  If  any  duplex  consists  of  a  Watson-Crick 
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pair,  it  will  remain  bonded  during  the  remaining  epochs.  If  one  strand  is  free  and  the 
other  bonded,  the  free  strand  will  melt  the  current  duplex  if  the  free  energy  between  the 
free  strand  and  one  of  the  bonded  strands  is  less  than  the  free  energy  of  the  existing 
duplex.  This  procedure  produces  a  new  more  stable  duplex  and  one  free  strand.  If  both 
strands  are  bonded,  both  existing  duplexes  will  melt  if  the  free  energy  between  one  of  the 
strands  in  one  duplex  and  one  of  the  strands  in  the  other  is  less  than  the  free  energy  of 
both  current  duplexes.  A  new  more  stable  duplex  is  then  formed  and  the  other  two 
strands  become  free.  At  each  epoch  at  most  one  of  these  procedures  may  occur.  The 
program  halts  when  all  strands  reside  in  Watson-Crick  duplexes  and  the  code  is  self 
assembled. 

5.2  Comparisons  to  SLSDesigner 

One  other  well  known  DNA  code  generation  software  suite  is  SLSDesigner,  produced  by 
the  Beta  Lab,  University  of  British  Columbia.  SLSDesigner  is  uses  a  genetic  algorithm 
which  uses  the  Beta  Lab's  Pairfold  DNA  thermodynamic  modeler  as  its  object  function 

a 

[3].  The  computational  complexity  of  Pairfold  is  0(n  )  where  n  is  the  length  of  the 

stand(s)  under  consideration.  SynDcode  has  complexity  0(n~ ).  Figure  11  shows  the 
increased  efficiency  of  SynDcode.  Computationally  more  complex  algorithms  give  more 
accurate  pairwise  computations,  but  pairwise  accuracy  is  not  necessarily  the  most 
important  consideration  when  the  primary  objective  is  maximizing  the  size  of  a  code  for 
global  thresholds.  See  Sections  5.2  and  5.3. 


Length  vs.  Algorithmic  Cost 
I  SynDCode  |g|  Pairfold 


Figure  11:  SynDCode  and  Pairfold  computational  time  for  ~1 
million  strand  pair  comparisons 


5.2.1  Regression  of  SynDCode  to  Pairfold 

2  3  4 

Using  SynDCode's  i//F(x,y),if/~(x, y) ,  y/  (x,y)and  i//  (x,y),  a  multiply  linear 
regression  was  constructed  with  Pairfold(x,y)  against  X=  i// F  (x,  y) ,  Y=  i//  -  (x,  y)  - 
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O  'J  A 

i//  (x,y)  andZ=^  (x,y)-i//  (x,  y) .  The  accuracy  and  correlation  were  very  good.  See 
Figures  12  and  13. 


Regression  Statistics 

Multiple  R 

R  Square 

Adjusted  R  Square 
Standard  Percent  Error 
Absolute  Percent  Error 

0.785477 

0.616974 

0.616948 

-0.05 

0.18 

Coefficients 

Standard  Error 

Intercept 

Weighed  2-Stems=X 
2'-Stem  Score=Y 

-4.050290292 

0.76579965 

-0.825263954 

0.023840939 

0.002948573 

0.006985618 

Observations 

44700 

3'-Stem  Score=Z 

-0.736910737 

0.009371336 

Figure  12:  Pairfold  vs.  SynDCode(X,Y,Z) 


Pairfold  «  0.77 X  -  .083F  -  .74 Z  -  4.05 


2C.00  -1 


o 

'ro 

Q. 


0  10 


SynDCode 
(XX  Z) 


00 


♦  Pairfold 

■  Predicted  Pairfold 


Figure  13:  Pairfold  vs.  SynDCode  multiply  regression  plot 

5.2.2  Code  Generation:  SynDCode  vs.  SLSDesigner 

Using  a  combination  of  our  regression  fit  and  greedy  parsing  algorithm,  codes  satisfying 
Pairfold  constraints  were  generated  using  SynDCode  and  SLSDesigner.  SynDCode's 
Markov  generation  method  and  increased  speed  performed  much  better  than 
SLSDesigner's  stochastic  search  algorithm  in  the  more  computationally  complex  Pairfold 
program.  See  Figure  14. 
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SynDCode  v.  SIsDnaDesigner 
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Figure  14:  Comparison  of  DNA  Code  Generation  Performance  of 
SynDCode  and  SLSDesigner 
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